Offensive Rating measures points scored per 100 possessions, providing a pace-adjusted metric of scoring efficiency. This is our primary dependent variable.
1.1 Time Series Visualization and Component Identification
Code
# Convert to time series objectts_ortg <-ts(league_avg$ORtg, start =1980, frequency =1)# Create visualizationdf_ortg <-data.frame(Year = league_avg$Season,Value = league_avg$ORtg,Era =case_when( league_avg$Season <2012~"Pre-Analytics Era", league_avg$Season >=2012& league_avg$Season <2020~"Analytics Era", league_avg$Season >=2020~"Post-COVID Era" ))ggplot(df_ortg, aes(x = Year, y = Value, color = Era)) +geom_line(size =1.2) +geom_point(size =3) +geom_vline(xintercept =2012, linetype ="dashed", color ="#f58426", size =1) +geom_vline(xintercept =2020, linetype ="dashed", color ="#bec0c2", size =1) +annotate("text",x =2012, y =112, label ="Analytics Era\nBegins (2012)",hjust =-0.1, color ="#f58426", fontface ="bold", size =3.5 ) +annotate("text",x =2020, y =112, label ="COVID-19\n(2020)",hjust =1.1, color ="#bec0c2", fontface ="bold", size =3.5 ) +scale_color_manual(values =c("Pre-Analytics Era"="#006bb6","Analytics Era"="#f58426","Post-COVID Era"="#bec0c2" )) +labs(title ="NBA Offensive Rating (1980-2025): Evolution of Scoring Efficiency",subtitle ="Points per 100 possessions - our primary measure of offensive efficiency",x ="Season",y ="Offensive Rating (ORtg)",color ="Era" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray40"),legend.position ="bottom" )
Pace measures possessions per 48 minutes, representing game tempo. Our research question asks: Does pace amplify or mediate the impact of shot selection on efficiency?
2.1 Time Series Visualization
Code
ts_pace <-ts(league_avg$Pace, start =1980, frequency =1)df_pace <-data.frame(Year = league_avg$Season,Value = league_avg$Pace,Era = df_ortg$Era)ggplot(df_pace, aes(x = Year, y = Value, color = Era)) +geom_line(size =1.2) +geom_point(size =3) +geom_vline(xintercept =2012, linetype ="dashed", color ="#f58426", size =1) +scale_color_manual(values =c("Pre-Analytics Era"="#006bb6","Analytics Era"="#f58426","Post-COVID Era"="#bec0c2" )) +labs(title ="NBA Pace (1980-2025): Possessions Per 48 Minutes",subtitle ="U-shaped pattern: fast 1980s → slow 2000s → moderate recovery",x ="Season",y ="Pace (Possessions per 48 min)",color ="Era" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray40"),legend.position ="bottom" )
Components: U-shaped pattern. 1980-2005 (decline 102→90), 2005-2025 (recovery to 97). Independent of analytics trend.
2.2 Lag Plots
Code
gglagplot(ts_pace, do.lines =FALSE, lags =9) +ggtitle("Lag Plot of Pace") +theme_minimal()
Interpretation: Complex patterns due to U-shape. Early lags (strong), medium lags (weak at inflection), long lags (distinct 1980s vs 2000s clusters).
2.3 ACF and PACF Analysis
Code
acf_pace <-ggAcf(ts_pace, lag.max =20) +labs(title ="ACF of Pace") +theme_minimal()pacf_pace <-ggPacf(ts_pace, lag.max =20) +labs(title ="PACF of Pace") +theme_minimal()acf_pace / pacf_pace
Interpretation: Slow decay (non-stationary), U-shaped pattern. PACF: lag 1 spike.
2.3 Stationarity Testing
Code
adf_pace <-adf.test(ts_pace)print(adf_pace)
Augmented Dickey-Fuller Test
data: ts_pace
Dickey-Fuller = -1.4007, Lag order = 3, p-value = 0.8116
alternative hypothesis: stationary
Code
diff_pace <-diff(ts_pace, differences =1)par(mfrow =c(2, 1))plot(ts_pace, main ="Original Pace Series", ylab ="Pace", xlab ="Year")plot(diff_pace, main ="First Differenced Pace Series", ylab ="Change in Pace", xlab ="Year")
Code
acf_diff_pace <-ggAcf(diff_pace, lag.max =20) +labs(title ="ACF of First Differenced Pace") +theme_minimal()pacf_diff_pace <-ggPacf(diff_pace, lag.max =20) +labs(title ="PACF of First Differenced Pace") +theme_minimal()acf_diff_pace / pacf_diff_pace
3PAr measures the percentage of field goal attempts that are three-pointers. Research question: Did the rise in 3-point volume precede or follow efficiency improvements?
3.1 Time Series Visualization
Code
ts_3par <-ts(league_avg$`3PAr`, start =1980, frequency =1)df_3par <-data.frame(Year = league_avg$Season,Value = league_avg$`3PAr`,Era = df_ortg$Era)ggplot(df_3par, aes(x = Year, y = Value, color = Era)) +geom_line(size =1.2) +geom_point(size =3) +geom_vline(xintercept =2012, linetype ="dashed", color ="#f58426", size =1) +annotate("text",x =2012, y =0.44, label ="Analytics Era Begins",hjust =-0.05, color ="#f58426", fontface ="bold", size =3.5 ) +scale_color_manual(values =c("Pre-Analytics Era"="#006bb6","Analytics Era"="#f58426","Post-COVID Era"="#bec0c2" )) +scale_y_continuous(labels = scales::percent_format(accuracy =1)) +labs(title ="NBA 3-Point Attempt Rate (1980-2025)",subtitle ="Percentage of field goal attempts that are three-pointers",x ="Season",y ="3-Point Attempt Rate (3PAr)",color ="Era" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray40"),legend.position ="bottom" )
Component Identification:
Trend: Strong upward trend with clear acceleration post-2012
Structural Break: 2012 marks inflection point from gradual to rapid growth
Pattern: Similar acceleration timing to ORtg (both post-2012)
Augmented Dickey-Fuller Test
data: diff_3par
Dickey-Fuller = -3.5956, Lag order = 3, p-value = 0.04462
alternative hypothesis: stationary
3.4 Moving Average Smoothing for 3PAr
Code
# Calculate moving averages with different windowsma_3par_3 <-ma(ts_3par, order =3) # 3-year window (short-term)ma_3par_5 <-ma(ts_3par, order =5) # 5-year window (medium-term)ma_3par_10 <-ma(ts_3par, order =10) # 10-year window (long-term)# Create comparison plotautoplot(ts_3par, series ="Original") +autolayer(ma_3par_3, series ="MA(3)") +autolayer(ma_3par_5, series ="MA(5)") +autolayer(ma_3par_10, series ="MA(10)") +scale_color_manual(values =c("Original"="gray60","MA(3)"="#006bb6","MA(5)"="#f58426","MA(10)"="#000000" ),breaks =c("Original", "MA(3)", "MA(5)", "MA(10)") ) +scale_y_continuous(labels = scales::percent_format(accuracy =1)) +labs(title ="3-Point Attempt Rate: Moving Average Smoothing Comparison",subtitle ="Analytics revolution's exponential growth pattern clearly visible",y ="3-Point Attempt Rate (3PAr)",x ="Season",color ="Series" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray40"),legend.position ="bottom" )
Interpretation: - MA(3): Preserves 1995-1997 spike (rule change) - MA(5): Exponential pattern clear: 1980-2012 (gradual), 2012-2025 (rapid surge) - MA(10): 2012 inflection point where slope doubles - 2012 represents fundamental acceleration, not temporary variance
4. Attendance: COVID-19 Impact Analysis
Attendance measures total and average game attendance, providing a direct quantification of the COVID-19 pandemic’s impact on live sports.
4.1 Time Series Visualization
Code
# Calculate league-wide attendance by seasonattendance_data <- all_adv_data %>%group_by(Season) %>%summarise(Total_Attendance =sum(`Unnamed: 29_level_0_Attend.`, na.rm =TRUE),Avg_Attendance =mean(`Unnamed: 30_level_0_Attend./G`, na.rm =TRUE),.groups ="drop" )# Create time series (focusing on modern era 1990-2025)attendance_data <- attendance_data %>%filter(Season >=1990)ts_attendance <-ts(attendance_data$Total_Attendance, start =1990, frequency =1)
Code
df_attendance <-data.frame(Year = attendance_data$Season,Value = attendance_data$Total_Attendance /1e6, # Convert to millionsEra =case_when( attendance_data$Season <2020~"Pre-COVID", attendance_data$Season >=2020& attendance_data$Season <2022~"COVID Era", attendance_data$Season >=2022~"Post-COVID Recovery" ))ggplot(df_attendance, aes(x = Year, y = Value, color = Era)) +geom_line(size =1.2) +geom_point(size =3) +geom_vline(xintercept =2020, linetype ="dashed", color ="red", size =1) +annotate("text",x =2020, y =24, label ="COVID-19\nPandemic (2020)",hjust =-0.05, color ="red", fontface ="bold", size =3.5 ) +annotate("rect",xmin =2020, xmax =2021, ymin =0, ymax =25,alpha =0.1, fill ="red" ) +scale_color_manual(values =c("Pre-COVID"="#006bb6","COVID Era"="#d62728","Post-COVID Recovery"="#2ca02c" )) +labs(title ="NBA Total Attendance (1990-2025): COVID-19 Disruption and Recovery",subtitle ="90% collapse in 2020-21 followed by gradual recovery",x ="Season",y ="Total Attendance (Millions)",color ="Era" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray40"),legend.position ="bottom" )
Components: - Trend: Stable 21-22M (1990-2019) - Shock: 90% collapse in 2020-21 (bubble + limited capacity) - Recovery: To ~18M by 2025 (still 15% below pre-COVID) - Type: Exogenous intervention, ideal for intervention analysis
4.2 Lag Plots
Code
gglagplot(ts_attendance, do.lines =FALSE, lags =9) +ggtitle("Lag Plot of Total Attendance") +theme_minimal()
Interpretation: Pre-COVID cluster + 2020-21 outliers. Distinct separation shows temporary shock, not new regime.
4.3 ACF and PACF Analysis
Code
acf_attendance <-ggAcf(ts_attendance, lag.max =15) +labs(title ="ACF of Total Attendance") +theme_minimal()pacf_attendance <-ggPacf(ts_attendance, lag.max =15) +labs(title ="PACF of Total Attendance") +theme_minimal()acf_attendance / pacf_attendance
Interpretation: High ACF (pre-COVID stability), 2020-21 outlier effect. Ideal for intervention analysis.
4.4 Moving Average Smoothing for Attendance
Code
# Calculate moving averagesma_attendance_3 <-ma(ts_attendance, order =3)ma_attendance_5 <-ma(ts_attendance, order =5)# Plot comparisonautoplot(ts_attendance, series ="Original") +autolayer(ma_attendance_3, series ="MA(3)") +autolayer(ma_attendance_5, series ="MA(5)") +scale_color_manual(values =c("Original"="gray60","MA(3)"="#006bb6","MA(5)"="#f58426" ),breaks =c("Original", "MA(3)", "MA(5)") ) +labs(title ="Attendance: Moving Average Smoothing (COVID Shock Visible)",subtitle ="Smoothing cannot remove the dramatic 2020-21 disruption",y ="Total Attendance (millions)",x ="Season",color ="Series" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray40"),legend.position ="bottom" )
Interpretation: COVID shock persists through all MA windows (too extreme to smooth). Pre-COVID stability → sharp drop → partial recovery. Perfect for intervention modeling.
5. Financial Data: Sports Betting Stocks - Time Series with Seasonality
Sports betting stocks (DKNG, PENN, MGM, CZR) demonstrate time series with seasonality (weekly patterns, frequency=52) for proper decompose() example. All four experienced COVID-era boom-bust-stabilization cycles tied to online betting surge.
Interpretation: Slow decay (non-stationary). Stock prices are random walks with drift, require differencing.
5.7 PENN Detailed Analysis (Comparison to DKNG)
5.7.1 PENN Time Series Visualization
Code
autoplot(ts_penn) +annotate("rect", xmin =2021.5, xmax =2022, ymin =0, ymax =140, alpha =0.1, fill ="red") +annotate("text", x =2021.75, y =130, label ="Peak Bubble\n(Barstool Hype)", color ="red", fontface ="bold", size =3) +annotate("rect", xmin =2023, xmax =2023.5, ymin =0, ymax =140, alpha =0.1, fill ="purple") +annotate("text", x =2023.25, y =10, label ="ESPN BET\nTransition", color ="purple", fontface ="bold", size =3) +labs(title ="Penn Entertainment (PENN) Weekly Stock Price (2020-2024)",subtitle ="Extreme volatility: Barstool hype → ESPN BET transition struggle",x ="Year", y ="Avg Weekly Adj Close ($)" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14))
Components: Trend (extreme boom $30→$136, severe collapse to $5-20), Seasonality (weekly), Highest volatility among all betting stocks. Multiplicative model.
PENN vs DKNG Comparison: - PENN: 500% spike (hype-driven) → 85% collapse (execution risk) - DKNG: 350% peak → stable recovery (fundamental growth) - Key Difference: PENN’s Barstool→ESPN BET transition created operational chaos; DKNG maintained pure-play focus
Interpretation: - Trend: Extreme spike (2021-22) → catastrophic decline (2022-24, falling below $10) - Seasonal: Weekly patterns similar to DKNG but drowned by volatility - Random: Extreme variance (operational risk + market speculation) - Multiplicative: Volatility explosion during boom, compression during bust
5.7.3 Moving Average Smoothing (PENN)
Code
ma_penn_4 <-ma(ts_penn, order =4)ma_penn_13 <-ma(ts_penn, order =13)ma_penn_52 <-ma(ts_penn, order =52)autoplot(ts_penn, series ="Original") +autolayer(ma_penn_4, series ="MA(4 weeks)") +autolayer(ma_penn_13, series ="MA(13 weeks)") +autolayer(ma_penn_52, series ="MA(52 weeks)") +scale_color_manual(values =c("Original"="gray60","MA(4 weeks)"="#006bb6","MA(13 weeks)"="#f58426","MA(52 weeks)"="#000000" ),breaks =c("Original", "MA(4 weeks)", "MA(13 weeks)", "MA(52 weeks)") ) +labs(title ="PENN Stock: Moving Average Smoothing Comparison",subtitle ="Even annual smoothing cannot hide the structural collapse",y ="Stock Price ($)", x ="Year", color ="Series" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray40"),legend.position ="bottom" )
Interpretation: - MA(4): Preserves all major spikes (extreme short-term volatility) - MA(13): Clear boom-bust cycle, no stabilization - MA(52): Reveals fundamental decline (peak→collapse, no floor established) - Contrast with DKNG: DKNG’s MA(52) shows stabilization; PENN shows continued deterioration
5.7.4 ACF and Lag Plots (PENN)
Code
acf_penn <-ggAcf(ts_penn, lag.max =52) +labs(title ="ACF of PENN Weekly Stock Price") +theme_minimal()pacf_penn <-ggPacf(ts_penn, lag.max =52) +labs(title ="PACF of PENN Weekly Stock Price") +theme_minimal()acf_penn / pacf_penn
Interpretation: Slow decay (non-stationary) similar to DKNG. Both require differencing. However, PENN’s higher volatility may create challenges for ARIMA/SARIMA modeling.
5.8 Connection to NBA Analysis
DKNG illustrates: (1) seasonality with frequency>1, (2) multiplicative vs additive models, (3) COVID boom paralleling sports betting surge, (4) MA window effects across temporal scales, (5) successful pure-play strategy.
PENN contrasts: (1) same seasonality structure but extreme operational volatility, (2) demonstrates how strategic missteps (Barstool hype → ESPN BET transition) overwhelm market trends, (3) cautionary tale for analytics-driven optimization gone wrong.
7. Summary of EDA Findings
7.1 Stationarity Summary
Series
Frequency
ADF (Original)
ADF (Differenced)
Conclusion
ORtg
Annual (1)
Non-stationary
Stationary
Requires d=1
Pace
Annual (1)
Non-stationary
Stationary
Requires d=1
3PAr
Annual (1)
Non-stationary
Stationary
Requires d=1
Attendance
Annual (1)
Non-stationary
Stationary
Requires d=1 (COVID shock)
Betting Stocks (DKNG, PENN, MGM, CZR)
Weekly (52)
Non-stationary
Stationary
Requires d=1 (typical for prices)
All series exhibit non-stationarity in levels but achieve stationarity after first differencing.
7.2 Decomposition Model Choices (Additive vs Multiplicative)
Series
Model Type
Justification
NBA Metrics (ORtg, Pace, 3PAr, Attendance)
Additive
Constant variance; absolute changes measured in points/percentages; no scaling of volatility with level
Betting Stocks (DKNG, PENN, MGM, CZR)
Multiplicative
Variance scales with price level; percentage changes more meaningful; heteroskedasticity present
Optimal for identifying structural breaks (2012 analytics era)
MA(10+)
Long-term smoothing
Reveals only fundamental regime changes
Key Findings from MA Analysis: - 2012 structural break visible across all window sizes for ORtg and 3PAr - Pace’s U-shape becomes clearer with increased smoothing, confirming independence from analytics trend - COVID shock in attendance persists even through MA(5), indicating extreme magnitude - DKNG’s boom-bust clearly visible in MA(13) quarterly smoothing
7.5 Lag Plot Interpretations
ORtg, 3PAr: Strong positive correlations at all lags → persistent trends, no mean reversion
Pace: More complex patterns due to U-shaped trajectory → different lag structures in different eras
Attendance: Pre-COVID cluster with outliers at 2020-21 → exogenous shock structure